{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Build a Simple Retrieval-Augmented Generation (RAG) Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this use case, we show how to build and evaluate a simple RAG pipeline with LightRAG. RAG (Retrieval-Augmented Generation) pipelines leverage a retriever to fetch relevant context from a knowledge base (e.g., a document database) which is then fed to an LLM generator with the query to produce the answer. This allows the model to generate more contextually relevant answers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import needed modules, including modules for loading datasets, constructing a RAG pipeline, and evaluating the performance of the RAG pipeline.\n",
    "import yaml\n",
    "from typing import Any, List, Optional, Union\n",
    "\n",
    "from datasets import load_dataset\n",
    "\n",
    "from lightrag.core.types import Document\n",
    "from lightrag.core.component import Component, Sequential\n",
    "from lightrag.core.embedder import Embedder\n",
    "from lightrag.core.document_splitter import DocumentSplitter\n",
    "from lightrag.core.data_components import (\n",
    "    RetrieverOutputToContextStr,\n",
    "    ToEmbeddings,\n",
    ")\n",
    "from lightrag.components.retriever import FAISSRetriever\n",
    "from lightrag.core.generator import Generator\n",
    "from lightrag.core.db import LocalDocumentDB\n",
    "from lightrag.core.string_parser import JsonParser\n",
    "\n",
    "from lightrag.eval import (\n",
    "    AnswerMatchAcc,\n",
    "    RetrieverRecall,\n",
    "    RetrieverRelevance,\n",
    "    LLMasJudge,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Here, we use the OpenAIClient in the Generator as an example, but you can use any other clients (with the corresponding API Key as needed)\n",
    "from lightrag.components.model_client import OpenAIClient\n",
    "import os\n",
    "os.environ[\"KMP_DUPLICATE_LIB_OK\"] = \"True\"\n",
    "import dotenv\n",
    "# load evironment\n",
    "dotenv.load_dotenv(dotenv_path=\".env\", override=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Define the configuration for the RAG pipeline**. We load the configuration from a YAML file. This configuration specifies the components of the RAG pipeline, including the text_splitter, vectorizer, retriever, and generator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'vectorizer': {'batch_size': 100, 'model_kwargs': {'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}}, 'retriever': {'top_k': 2}, 'generator': {'model': 'gpt-3.5-turbo', 'temperature': 0.3, 'stream': False}, 'text_splitter': {'split_by': 'sentence', 'chunk_size': 1, 'chunk_overlap': 0}}\n"
     ]
    }
   ],
   "source": [
    "# Define the configuration settings for the RAG pipeline.\n",
    "with open(\"./simple_rag.yaml\", \"r\") as file:\n",
    "    settings = yaml.safe_load(file)\n",
    "print(settings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Load a dataset**. Here, We use the [HotpotQA](https://huggingface.co/datasets/hotpotqa/hotpot_qa) dataset as an example. Each data sample in HotpotQA has *question*, *answer*, *context* and *supporting_facts* selected from the whole context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "example: {'id': '5a7a06935542990198eaf050', 'question': \"Which magazine was started first Arthur's Magazine or First for Women?\", 'answer': \"Arthur's Magazine\", 'type': 'comparison', 'level': 'medium', 'supporting_facts': {'title': [\"Arthur's Magazine\", 'First for Women'], 'sent_id': [0, 0]}, 'context': {'title': ['Radio City (Indian radio station)', 'History of Albanian football', 'Echosmith', \"Women's colleges in the Southern United States\", 'First Arthur County Courthouse and Jail', \"Arthur's Magazine\", '2014–15 Ukrainian Hockey Championship', 'First for Women', 'Freeway Complex Fire', 'William Rast'], 'sentences': [[\"Radio City is India's first private FM radio station and was started on 3 July 2001.\", ' It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003).', ' It plays Hindi, English and regional songs.', ' It was launched in Hyderabad in March 2006, in Chennai on 7 July 2006 and in Visakhapatnam October 2007.', ' Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features.', ' The Radio station currently plays a mix of Hindi and Regional music.', ' Abraham Thomas is the CEO of the company.'], ['Football in Albania existed before the Albanian Football Federation (FSHF) was created.', \" This was evidenced by the team's registration at the Balkan Cup tournament during 1929-1931, which started in 1929 (although Albania eventually had pressure from the teams because of competition, competition started first and was strong enough in the duels) .\", ' Albanian National Team was founded on June 6, 1930, but Albania had to wait 16 years to play its first international match and then defeated Yugoslavia in 1946.', ' In 1932, Albania joined FIFA (during the 12–16 June convention ) And in 1954 she was one of the founding members of UEFA.'], ['Echosmith is an American, Corporate indie pop band formed in February 2009 in Chino, California.', ' Originally formed as a quartet of siblings, the band currently consists of Sydney, Noah and Graham Sierota, following the departure of eldest sibling Jamie in late 2016.', ' Echosmith started first as \"Ready Set Go!\"', ' until they signed to Warner Bros.', ' Records in May 2012.', ' They are best known for their hit song \"Cool Kids\", which reached number 13 on the \"Billboard\" Hot 100 and was certified double platinum by the RIAA with over 1,200,000 sales in the United States and also double platinum by ARIA in Australia.', ' The song was Warner Bros.', \" Records' fifth-biggest-selling-digital song of 2014, with 1.3 million downloads sold.\", ' The band\\'s debut album, \"Talking Dreams\", was released on October 8, 2013.'], [\"Women's colleges in the Southern United States refers to undergraduate, bachelor's degree–granting institutions, often liberal arts colleges, whose student populations consist exclusively or almost exclusively of women, located in the Southern United States.\", \" Many started first as girls' seminaries or academies.\", ' Salem College is the oldest female educational institution in the South and Wesleyan College is the first that was established specifically as a college for women.', ' Some schools, such as Mary Baldwin University and Salem College, offer coeducational courses at the graduate level.'], ['The First Arthur County Courthouse and Jail, was perhaps the smallest court house in the United States, and serves now as a museum.'], [\"Arthur's Magazine (1844–1846) was an American literary periodical published in Philadelphia in the 19th century.\", ' Edited by T.S. Arthur, it featured work by Edgar A. Poe, J.H. Ingraham, Sarah Josepha Hale, Thomas G. Spear, and others.', ' In May 1846 it was merged into \"Godey\\'s Lady\\'s Book\".'], ['The 2014–15 Ukrainian Hockey Championship was the 23rd season of the Ukrainian Hockey Championship.', ' Only four teams participated in the league this season, because of the instability in Ukraine and that most of the clubs had economical issues.', ' Generals Kiev was the only team that participated in the league the previous season, and the season started first after the year-end of 2014.', ' The regular season included just 12 rounds, where all the teams went to the semifinals.', ' In the final, ATEK Kiev defeated the regular season winner HK Kremenchuk.'], [\"First for Women is a woman's magazine published by Bauer Media Group in the USA.\", ' The magazine was started in 1989.', ' It is based in Englewood Cliffs, New Jersey.', ' In 2011 the circulation of the magazine was 1,310,696 copies.'], ['The Freeway Complex Fire was a 2008 wildfire in the Santa Ana Canyon area of Orange County, California.', ' The fire started as two separate fires on November 15, 2008.', ' The \"Freeway Fire\" started first shortly after 9am with the \"Landfill Fire\" igniting approximately 2 hours later.', ' These two separate fires merged a day later and ultimately destroyed 314 residences in Anaheim Hills and Yorba Linda.'], ['William Rast is an American clothing line founded by Justin Timberlake and Trace Ayala.', ' It is most known for their premium jeans.', ' On October 17, 2006, Justin Timberlake and Trace Ayala put on their first fashion show to launch their new William Rast clothing line.', ' The label also produces other clothing items such as jackets and tops.', ' The company started first as a denim line, later evolving into a men’s and women’s clothing line.']]}}\n",
      "ground truth context: {'title': [\"Arthur's Magazine\", 'First for Women'], 'sent_id': [0, 0]}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/mengliu/Library/Caches/pypoetry/virtualenvs/lightrag-project-OrKUABKc-py3.12/lib/python3.12/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by promote_options='default'.\n",
      "  table = cls._concat_blocks(blocks, axis=0)\n"
     ]
    }
   ],
   "source": [
    "# Load the HotpotQA dataset. We select a subset of the dataset for demonstration purposes.\n",
    "dataset = load_dataset(path=\"hotpot_qa\", name=\"fullwiki\")\n",
    "selected_dataset = dataset[\"train\"].select(range(5))\n",
    "print(f\"example: {selected_dataset[0]}\")\n",
    "print(f\"ground truth context: {selected_dataset[0]['supporting_facts']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Define a simple RAG pipeline**. Define a RAG pipeline by specifying the key components, such as *vectorizer*, *retriever*, and *generator*. For more information on these components, refer to the developer notes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The defined RAG pipeline.\n",
    "class RAG(Component):\n",
    "\n",
    "    def __init__(self, settings: dict):\n",
    "        super().__init__()\n",
    "        self.vectorizer_settings = settings[\"vectorizer\"]\n",
    "        self.retriever_settings = settings[\"retriever\"]\n",
    "        self.generator_model_kwargs = settings[\"generator\"]\n",
    "        self.text_splitter_settings = settings[\"text_splitter\"]\n",
    "\n",
    "        vectorizer = Embedder(\n",
    "            model_client=OpenAIClient(),\n",
    "            model_kwargs=self.vectorizer_settings[\"model_kwargs\"],\n",
    "        )\n",
    "\n",
    "        text_splitter = DocumentSplitter(\n",
    "            split_by=self.text_splitter_settings[\"split_by\"],\n",
    "            split_length=self.text_splitter_settings[\"chunk_size\"],\n",
    "            split_overlap=self.text_splitter_settings[\"chunk_overlap\"],\n",
    "        )\n",
    "        self.data_transformer = Sequential(\n",
    "            text_splitter,\n",
    "            ToEmbeddings(\n",
    "                vectorizer=vectorizer,\n",
    "                batch_size=self.vectorizer_settings[\"batch_size\"],\n",
    "            ),\n",
    "        )\n",
    "        self.data_transformer_key = self.data_transformer._get_name()\n",
    "        # initialize retriever, which depends on the vectorizer too\n",
    "        self.retriever = FAISSRetriever(\n",
    "            top_k=self.retriever_settings[\"top_k\"],\n",
    "            dimensions=self.vectorizer_settings[\"model_kwargs\"][\"dimensions\"],\n",
    "            vectorizer=vectorizer,\n",
    "        )\n",
    "        self.retriever_output_processors = RetrieverOutputToContextStr(deduplicate=True)\n",
    "\n",
    "        self.db = LocalDocumentDB()\n",
    "\n",
    "        # initialize generator\n",
    "        self.generator = Generator(\n",
    "            preset_prompt_kwargs={\n",
    "                \"task_desc_str\": r\"\"\"\n",
    "                    You are a helpful assistant.\n",
    "\n",
    "                    Your task is to answer the query that may or may not come with context information.\n",
    "                    When context is provided, you should stick to the context and less on your prior knowledge to answer the query.\n",
    "\n",
    "                    Output JSON format:\n",
    "                    {\n",
    "                        \"answer\": \"The answer to the query\",\n",
    "                    }\"\"\"\n",
    "            },\n",
    "            model_client=OpenAIClient(),\n",
    "            model_kwargs=self.generator_model_kwargs,\n",
    "            output_processors=JsonParser(),\n",
    "        )\n",
    "        self.tracking = {\"vectorizer\": {\"num_calls\": 0, \"num_tokens\": 0}}\n",
    "\n",
    "    def build_index(self, documents: List[Document]):\n",
    "        self.db.load_documents(documents)\n",
    "        self.map_key = self.db.map_data()\n",
    "        print(f\"map_key: {self.map_key}\")\n",
    "        self.data_key = self.db.transform_data(self.data_transformer)\n",
    "        print(f\"data_key: {self.data_key}\")\n",
    "        self.transformed_documents = self.db.get_transformed_data(self.data_key)\n",
    "        self.retriever.build_index_from_documents(self.transformed_documents)\n",
    "\n",
    "    def generate(self, query: str, context: Optional[str] = None) -> Any:\n",
    "        if not self.generator:\n",
    "            raise ValueError(\"Generator is not set\")\n",
    "\n",
    "        prompt_kwargs = {\n",
    "            \"context_str\": context,\n",
    "            \"input_str\": query,\n",
    "        }\n",
    "        response = self.generator(prompt_kwargs=prompt_kwargs)\n",
    "        if response.error:\n",
    "            raise ValueError(f\"Error in generator: {response.error}\")\n",
    "        return response.data\n",
    "\n",
    "    def call(self, query: str) -> Any:\n",
    "        retrieved_documents = self.retriever(query)\n",
    "        # fill in the document\n",
    "        for i, retriever_output in enumerate(retrieved_documents):\n",
    "            retrieved_documents[i].documents = [\n",
    "                self.transformed_documents[doc_index]\n",
    "                for doc_index in retriever_output.doc_indexes\n",
    "            ]\n",
    "        # convert all the documents to context string\n",
    "        context_str = self.retriever_output_processors(retrieved_documents)\n",
    "\n",
    "        return self.generate(query, context=context_str), context_str"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run the RAG piepline for each example in the dataset, we need to first **build the index** and then **call the pipeline**. For each sample in the dataset, we create a list of documents to retrieve from, according to its corresponding *context* in the dataset. Each document has a title and a list of sentences. We use the `Document` class from `lightrag.core.types` to represent each document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To get the ground truth context string from the supporting_facts filed in HotpotQA. This function is specific to the HotpotQA dataset.\n",
    "def get_supporting_sentences(\n",
    "    supporting_facts: dict[str, list[Union[str, int]]], context: dict[str, list[str]]\n",
    ") -> List[str]:\n",
    "    \"\"\"\n",
    "    Extract the supporting sentences from the context based on the supporting facts.\n",
    "    \"\"\"\n",
    "    extracted_sentences = []\n",
    "    for title, sent_id in zip(supporting_facts[\"title\"], supporting_facts[\"sent_id\"]):\n",
    "        if title in context[\"title\"]:\n",
    "            index = context[\"title\"].index(title)\n",
    "            sentence = context[\"sentences\"][index][sent_id]\n",
    "            extracted_sentences.append(sentence)\n",
    "    return extracted_sentences\n",
    "\n",
    "\n",
    "questions = []\n",
    "retrieved_contexts = []\n",
    "gt_contexts = []\n",
    "pred_answers = []\n",
    "gt_answers = []\n",
    "for data in selected_dataset:\n",
    "    # build the document list\n",
    "    num_docs = len(data[\"context\"][\"title\"])\n",
    "    doc_list = [\n",
    "        Document(\n",
    "            meta_data={\"title\": data[\"context\"][\"title\"][i]},\n",
    "            text=\" \".join(data[\"context\"][\"sentences\"][i]),\n",
    "        )\n",
    "        for i in range(num_docs)\n",
    "    ]\n",
    "    rag = RAG(settings)\n",
    "    # build the index\n",
    "    rag.build_index(doc_list)\n",
    "    # call the pipeline\n",
    "    query = data[\"question\"]\n",
    "    response, context_str = rag.call(query)\n",
    "    gt_context_sentence_list = get_supporting_sentences(\n",
    "        data[\"supporting_facts\"], data[\"context\"]\n",
    "    )\n",
    "    questions.append(query)\n",
    "    retrieved_contexts.append(context_str)\n",
    "    gt_contexts.append(gt_context_sentence_list)\n",
    "    pred_answers.append(response[\"answer\"])\n",
    "    gt_answers.append(data[\"answer\"])\n",
    "    print(f\"query: {query}\")\n",
    "    print(f\"response: {response['answer']}\")\n",
    "    print(f\"ground truth response: {data['answer']}\")\n",
    "    print(f\"context_str: {context_str}\")\n",
    "    print(f\"ground truth context_str: {gt_context_sentence_list}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Evaluate the performance of the RAG pipeline**. We first evaluate the performance of the retriever component by calculating the *recall* of the retrieved context and the *relevance* score of the retrieved context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'retrieved_contexts' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[7], line 3\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;66;03m# Compute the recall.\u001b[39;00m\n\u001b[1;32m      2\u001b[0m retriever_recall \u001b[38;5;241m=\u001b[39m RetrieverRecall()\n\u001b[0;32m----> 3\u001b[0m avg_recall, recall_list \u001b[38;5;241m=\u001b[39m retriever_recall\u001b[38;5;241m.\u001b[39mcompute(\u001b[43mretrieved_contexts\u001b[49m, gt_contexts)\n\u001b[1;32m      4\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124maverage recall: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mavg_recall\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m      5\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mrecall list: \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mrecall_list\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m\"\u001b[39m)\n",
      "\u001b[0;31mNameError\u001b[0m: name 'retrieved_contexts' is not defined"
     ]
    }
   ],
   "source": [
    "# Compute the recall.\n",
    "retriever_recall = RetrieverRecall()\n",
    "avg_recall, recall_list = retriever_recall.compute(retrieved_contexts, gt_contexts)\n",
    "print(f\"average recall: {avg_recall}\")\n",
    "print(f\"recall list: {recall_list}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute the relevance.\n",
    "retriever_relevance = RetrieverRelevance()\n",
    "avg_relevance, relevance_list = retriever_relevance.compute(\n",
    "    retrieved_contexts, gt_contexts\n",
    ")\n",
    "print(f\"average relevance: {avg_relevance}\")\n",
    "print(f\"relevance list: {relevance_list}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we evaluate the generated answers using the AnswerMatchAcc metric, which compares the predicted answer with the ground truth answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute the answer match accuracy.\n",
    "answer_match_acc = AnswerMatchAcc(type=\"exact_match\")\n",
    "avg_acc, acc_list = answer_match_acc.compute(pred_answers, gt_answers)\n",
    "print(f\"average accuracy: {avg_acc}\")\n",
    "print(f\"accuracy list: {acc_list}\")\n",
    "answer_match_acc = AnswerMatchAcc(type=\"fuzzy_match\")\n",
    "avg_acc, acc_list = answer_match_acc.compute(pred_answers, gt_answers)\n",
    "print(f\"average accuracy: {avg_acc}\")\n",
    "print(f\"accuracy list: {acc_list}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We finally use an LLM as the judge for evaluating the performance. The task description in the `DEFAULT_LLM_EVALUATOR_PROMPT` is \"You are a helpful assistant. Given the question, ground truth answer, and predicted answer, you need to answer the judgement query. Output True or False according to the judgement query.\" You can customize the task description as needed. See the `lightrag.eval.LLMasJudge` class for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm_judge = LLMasJudge()\n",
    "judgement_query = (\n",
    "        \"For the question, does the predicted answer contain the ground truth answer?\"\n",
    "    )\n",
    "avg_judgement, judgement_list = llm_judge.compute(\n",
    "    questions, gt_answers, pred_answers, judgement_query\n",
    ")\n",
    "print(f\"average judgement: {avg_judgement}\")\n",
    "print(f\"judgement list: {judgement_list}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "lightrag-project",
   "language": "python",
   "name": "light-rag-project"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
